---
title: "Sleep Disorder Prediction"
author: "Greg Pologruto"
output:
flexdashboard::flex_dashboard:
theme:
version: 4
bootswatch: lumen
orientation: columns
vertical_layout: fill
source_code: embed
---
<style>
.chart-title { /* chart_title */
font-size: 16px;
}
body{ /* Normal */
font-size: 16px;
}
</style>
```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)
# global theme and colors (This code is from chatgpt)
theme_set(theme_minimal(base_size = 14))
okabe <- c("#0072B2", "#E69F00", "#F0E442",
"#009E73", "#56B4E9", "#D55E00", "#CC79A7")
scale_fill_discrete <- function(...) scale_fill_manual(values = okabe)
scale_color_discrete <- function(...) scale_color_manual(values = okabe)
```
Data Insight
===
Column {data-width=500}
---
### Predicting Sleep Disorders from Health Indicators
This project examines sleep disorders using data from *Sleep Health and Lifestyle* dataset to better understand what health factors are associated with sleep disorders. The dataset contains 374 observations and includes variables such as gender, sleep duration, and physical activity level. The response variable classifies individuals into three categories: no sleep disorder, insomnia, and sleep apnea. The primary goal of this study is to evaluate how well a multinomial logistic regression model can classify sleep disorder type based on the available predictors. In addition, the analysis investigates which predictors contribute most strongly to the model, how the model performs on unseen data, and whether certain variables predict one sleep disorder more than the other. Exploratory data analysis, model fitting, and validation using a train/test split were conducted to identify key patterns and evaluate predictive accuracy. The results suggest ... These findings contribute to understanding how behavioral and physiological factors influence sleep health.
Column {.tabset data-width=500}
---
### Research Questions
#### Main Question
* How well can a multinomial logistic regression model classify whether a patient has no sleep disorder, insomnia, or sleep apnea based on demographic, lifestyle, and physiological variables?
Within that question we will also look at
* Which variables contribute most strongly to predicting sleep disorder type in the model?
* How well does the model perform at predicting sleep disorder type on unseen data?
* Are certain variables more indicative of one sleep disorder as compared to another? (Ex: High stress is a bigger flag for insomnia than sleep apnea)
### Get to know the data
I conducted an exploratory data analysis of a Sleep Disorder Diagnosis dataset. The dataset contains 374 observations and 14 variables, collecting information about sleep, lifestyle, and health.
```{r package_data, warning= F, fig.height=7}
pacman::p_load(knitr, tidyverse, readr, janitor,DT)
sleep <- read_csv("Sleep_health_and_lifestyle_dataset.csv")
# rename
sleep <- clean_names(sleep)
# change variable types
sleep <- sleep |>
mutate(
gender = factor(gender, levels = c("Male", "Female")),
bmi_category = case_when(
bmi_category == "Normal Weight" ~ "Normal",
TRUE ~ bmi_category
),
bmi_category = factor(bmi_category, levels=c("Normal","Overweight","Obese")),
sleep_disorder = factor(sleep_disorder, levels=c("None", "Insomnia", "Sleep Apnea")),
occupation = factor(occupation)
)|>
# split blood_pressure into two separate numbers (120/80 -> s: 120, d: 80)
separate(blood_pressure, into = c("bp_systolic", "bp_diastolic"),
sep = "/", remove = TRUE, convert = TRUE)
datatable(sleep, class = 'cell-border stripe')
```
### Cleaning Data
I got the dataset from [Kaggle](https://www.kaggle.com/datasets/mdsultanulislamovi/sleep-disorder-diagnosis-dataset/data). It was simple to clean. There were no NA values in the dataset and I just had to change gender, bmi_category, and sleep_disorder to factors. I also split blood_pressure into two separate categories, bp_systolic and bp_diastolic. These two variables contain just the number (ex: 120/80 split into 120 and 80). In the variable bmi_category it had a normal category and a normal weight category. I changed this to be just one category.
EDA
===
Column {data-width=500}
---
### Background/Significance
As a college student, I am well aware of the importance of sleep. Sleep is a fundamental component of human health. Rising stress levels, demanding work schedules, and a decrease in overall physical health has contributed to widespread sleep problems in society. Understanding the factors that influence quality is essential for identifying qualities that may put people at risk for sleep disorders.
This dataset provides an opportunity to explore the relationships between physical/mental health and sleep. We will investigate how different aspects of daily life contribute to sleep behaviors. By analyzing the data, we can create a regression model to predict what kind of sleep disorder they have based on health indicators.
### Methods
Column {.tabset data-width=500}
---
### Diagnosed Sleep Disorders
```{r disorder_bar, fig.cap="Distribution of the types of diagnoses in the dataset. None is the most common with 219 occurrences. Sleep Apnea has 78 counts and Insomnia has 77."}
ggplot(sleep,
aes(x=sleep_disorder))+
geom_bar(color="black", fill="#0072B2")+
labs(
title="Distribution of Diagnoses",
x = "Sleep Disorder"
)+
geom_text(stat="count",
aes(label = after_stat(count)),
vjust = -0.4)+
theme_minimal()+
theme(legend.position ="none")
```
### Sleep Duration
```{r sleep_duration_hist, fig.cap="This is a histogram to show the distribution of number of hours slept per day by the patients. This graph shows that sleep duration is symmetric and multimodal. The distribution centers around 7.25 hours of sleep. The data ranges from 5.75 to 8.5 hours of sleep. There appears to be clusters in this data, 4 distinct groups can be made from looking at the histogram."}
ggplot(sleep,
aes(x=sleep_duration))+
geom_histogram(fill = "#0072B2", color="black")+
labs(
title="Distribution of Sleep Duration",
x="Sleep Duration (Hours)"
)+
scale_x_continuous(breaks = seq(5, 10, by = 0.25))+
theme_minimal()
```
### Sleep Duration by Sleep Disorder
definitive
```{r sleep_duration_by_disorder, fig.cap="As expected, people without a sleep disorder have a higher median than those with insomnia or sleep apnea. People with sleep apnea have a higher median than people with insomnia. Insomnia has some outliers above and below the mean and none has one outlier below the mean. These boxplots show a relationship between sleep duration and sleep disorder."}
ggplot(sleep,
aes(y=sleep_duration, x=sleep_disorder, fill = sleep_disorder))+
geom_boxplot(color="black")+
labs(
title = "Sleep Duration Boxplot by Sleep Disorder",
x = "Sleep Duration",
y = "Stress Level"
)+
theme_minimal()+
theme(legend.position = "none")
```
### Stress Level
```{r stress_level_bar, fig.cap="Stress level is a subjective stress rating from 1 (low) to 10 (high). This graph shows the distribution of it. 3 is the most common response with 71. It is also noteworthy that no one reported extremely low stress (1 or 2) or extremely high stress (9 or 10)."}
ggplot(sleep,
aes(x=as.factor(stress_level)))+
geom_bar(color="black",fill="#0072B2")+
labs(
title="Distribution of Stress Levels",
x= "Stress Levels (1-10)"
)+
geom_text(stat="count",
aes(label = after_stat(count)),
vjust = -0.4)+
theme_minimal()+
theme(legend.position = "none")
```
### Stress Level by Sleep Disorder
```{r stress_level_by_sleep_duration, fig.cap="People without a sleep disorder have a lower median than insomnia and sleep apnea. Insomnia and sleep apnea both have a median of 7. This boxplot signals to a relationship between stress level and sleep disorder."}
ggplot(sleep,
aes(y=stress_level, x=sleep_disorder, fill = sleep_disorder))+
geom_boxplot(color="black")+
labs(
title = "Stress Level Boxplot by Sleep Disorder",
x = "Sleep Disorder",
y = "Stress Level"
)+
theme_minimal()+
theme(legend.position = "none")
```
### Physical Activity by Sleep Disorder
```{r physical_activity_sleep_disorder, fig.cap="Interestingly, people with sleep apnea have a higher physical activity than those without a sleep disorder. People with insomnia have the lowest median. There are outliers above the mean in the insomnia group and below the mean in the insomnia and sleep apnea group. This boxplot shows that there is a relationship between physical activity and sleep disorder."}
ggplot(sleep,
aes(x=sleep_disorder, y=physical_activity_level, fill=sleep_disorder))+
geom_boxplot(color="black")+
labs(
title = "Physical Activity Boxplot by Sleep Disorder",
x = "Sleep Disorder",
y = "Physical Activity"
)+
theme_minimal()+
theme(legend.position = "none")
```
### Heart Rate by Sleep Disorder
```{r heart_rate_sleep_disorder, fig.cap="Boxplot shows the distribution of heart rate by sleep disorder categories. People without a sleep disorder have a lower median heart rate than those with sleep apnea and insomnia. Patients with insomnia have the highest median heart rate. For all categories there seem to be some outliers above the mean."}
ggplot(sleep,
aes(x=sleep_disorder, y=heart_rate, fill=sleep_disorder))+
geom_boxplot(color="black")+
labs(
title = "Heart Rate Boxplot by Sleep Disorder",
x = "Sleep Disorder",
y = "Heart Rate"
)+
theme_minimal()+
theme(legend.position = "none")
```
### Sleep Disorder by BMI
```{r bmi_sleep_disorder, fig.cap= "This stacked bar chart is very telling for our data. It shows that a large majority of people without a sleep disorder have a normal bmi. The majority in both sleep apnea and insomnia is overweight. There are also no obese people in the data that do not have a sleep disorder. There is a clear relationship between bmi and sleep disorder."}
ggplot(sleep,
aes(x = sleep_disorder, fill = bmi_category))+
geom_bar(position = "fill", color="black") +
labs(
title = "Sleep Disorder Barchart by BMI Category",
x = "Sleep Disorder",
y = "Proportion"
) +
theme_minimal()
```
### Occupation by Sleep Disorder
```{r occupation_sleep_disorder, fig.width=8, fig.cap="This stacked barchart shows occupations stacked with sleep disorders. From these bars it is obvious that each job has unique proportions. It is hard to determine from this graph if the variable will be useful in the model."}
# To display properly split into two graphs
occ_levels <- levels(sleep$occupation)
first6 <- occ_levels[1:6]
last5 <- occ_levels[7:11]
sleep1 <- sleep |> filter(occupation %in% first6)
sleep2 <- sleep |> filter(occupation %in% last5)
p1 <- ggplot(sleep1,
aes(x = occupation, fill = sleep_disorder))+
geom_bar(position = "fill", color="black") +
labs(
title = "Occupation Barchart by Sleep Disorder",
x = "Occupation",
y = "Proportion"
) +
theme_minimal()
p2 <- ggplot(sleep2,
aes(x = occupation, fill = sleep_disorder))+
geom_bar(position = "fill", color="black") +
labs(
x = "Occupation",
y = "Proportion"
) +
theme_minimal()+
theme(legend.position = "none")
library(patchwork)
p1/p2
```
### Gender
```{r gender_pie, fig.cap="This is a pie chart of the genders of the patients. There is an approximately even amount of males and females in this study."}
gender_count <- count(sleep, gender)
gender_count$percent <- round(gender_count$n/sum(gender_count$n)*100,2)
ggplot(gender_count,
aes(x="", y=percent, fill=gender))+
geom_bar(stat='identity', width =1, color='black')+
coord_polar("y",start=0)+
geom_text(aes(label=paste0(percent,"%")),
fontface="bold",
color="black",
position = position_stack(vjust=0.7))+
labs(title="Pie Chart of Gender")+
scale_fill_manual(values = c("Female" = "pink", "Male" = "lightblue"))+
theme_void()
```
### Pairwise Plot
```{r pairwise_plot, fig.cap="From this plot we can see there is a linear relationship between sleep duration and physical activity, heart rate and daily steps, and heart rate and sleep duration."}
pairs(sleep[,c("age","sleep_duration","physical_activity_level","heart_rate","daily_steps")],
col = sleep$sleep_disorder)
```
Multinomial Logistic Model
===
Column {.tabset data-width=500}
---
### First Model
### First Model Performance
### Assumptions
Before interpreting the multinomial logistic regression model, I checked several important assumptions. The observations in the dataset were independent, the predictors did not show extreme multicollinearity, and the variables included were appropriate for the relationships being modeled. Checking these assumptions helped ensure that the model’s results are trustworthy and that the conclusions drawn from the analysis are based on a stable and reliable model.
Column {.tabset data-width=500}
---
### Second Model
### Second Model Performance
### Best Model (Interpret)
Interpret coefficients (log-odds, odds ratios) and baseline category.
Evaluate goodness-of-fit (deviance, likelihood ratio test, pseudo-R²).
Assess predictive performance (confusion matrix, accuracy, per-class sensitivity/specificity, ROC & AUC).
Write a short Results section that reports the model transparently.
Discussion
===
Column {data-width=500}
---
### Results
Column {.tabset data-width=500}
---
### Discussion and Limitations
About Me
===
Column {.tabset data-width=700}
---
### Who Am I?
Hello my name is Greg Pologruto. I am a Honors Computer Science student at the University of Dayton with minors in Data Analytics and Mathematics. I am on track to graduate in May of 2027.
I have experience working for FedEx as a information technology intern. During my time there, I worked with big data (millions of packages) to create a dashboard that included many key performance indicators that were critical to the business and to my agile team.
I have skills in python, R and PowerBI. I have relevant projects in data visualization, data science, and machine learning. My goal is to acquire a data science internship in the technology field this summer.
Contact Me:
pologrutog1@udayton.edu | [LinkedIn](www.linkedin.com/in/greg-pologruto) | [GitHub](https://github.com/greg-pologruto)
### References
#### Sources
#### AI Usage
AI was used in this project to help create color themes, format the dashboard, and debug some code.
Column {data-width=300}
---
### Picture of Me
```{r picture, echo = F, fig.cap = "Me (left) and my roomate Logan showing our Christmas spirit.", out.width = '100%'}
knitr::include_graphics("C:/Users/gregp/OneDrive/Pictures/Archive1/af9ae276-2198-4d1c-92d0-d3a65a3ca96d.jpg")
```